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Abstract — The knowledge of end-to-end network distances is 
essential to many Internet applications. As active probing of all 
pairwise distances is infeasible in large-scale networks, a natural 
idea is to measure a few pairs and to predict the other ones 
without actually measuring them. This paper formulates the dis- 
tance prediction problem as matrix completion where unknown 
entries of an incomplete matrix of pairwise distances are to be 
predicted. The problem is solvable because strong correlations 
among network distances exist and cause the constructed distance 
matrix to be low rank. The new formulation circumvents the well- 
known drawbacks of existing approaches based on Euclidean 
embedding. 

A new algorithm, so-called Decentralized Matrix Factorization 
by Stochastic Gradient Descent (DMFSGD), is proposed to solve 
the network distance prediction problem. By letting network 
nodes exchange messages with each other, the algorithm is 
fully decentralized and only requires each node to collect and 
to process local measurements, with neither explicit matrix 
constructions nor special nodes such as landmarks and central 
servers. In addition, we compared comprehensively matrix factor- 
ization and Euclidean embedding to demonstrate the suitability 
of the former on network distance prediction. We further studied 
the incorporation of a robust loss function and of non-negativity 
constraints. Extensive experiments on various publicly-available 
datasets of network delays show not only the scalability and the 
accuracy of our approach but also its usability in real Internet 
applications. 

Index Terms — network distance prediction, matrix completion, 
matrix factorization, stochastic gradient descent. 



I. Introduction 

On large networks such as the Internet, many applications 
require the knowledge of end-to-end network distances in 
order to achieve Quality of Service (QoS) objectives. In the 
networking field, the distance between two network nodes is 
defined by the delay or latency between them, in the form 
of either one-way delay or more often round-trip time (RTT). 
Examples include peer-to-peer file sharing, overlay networks, 
and content distribution systems where peers preferably access 
nodes or servers that are likely to respond fast |1]-|7|. 

Clearly, it is infeasible to probe actively end-to-end dis- 
tances among all pairs of nodes in large networks as the 
demand in terms of measurements grows quadratically with the 
scale of the network. A natural idea is to probe a small set of 
pairs and then predict the distances between other pairs where 
there are no direct measurements. This understanding has 
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motivated numerous research on Network Coordinate System 
(NCS) |[8)-p2). For instance, approaches based on Euclidean 
embedding have been widely studied and achieved good 
performance in interesting scenarios (9|, fl3) . Realizing that 
the assumption of Euclidean distance properties (symmetry 
and triangle inequality) are often violated in practice, as 
observed in various studies |9|, [14|-[18|, matrix factorization 
has recently drawn increasing attention of the networking 
community (10), fPT) . 

In this paper, we investigate matrix factorization for network 
distance prediction. In particular, we formulate the problem of 
network distance prediction as a matrix completion problem 
where a partially observed matrix is to be completed (19)- 
(21). Here, the matrix contains distance measurements such 
as RTTs between network nodes with some of them known 
and the others unknown thus to be filled. Matrix completion 
is only possible if matrix entries are largely correlated, which 
certainly holds for network distances because Internet paths 
with nearby end nodes often overlap and share common 
bottleneck links. These redundancies among network paths 
cause the constructed distance matrix to be low rank, which 
will be empirically demonstrated for various RTT datasets. 

Although numerous approaches to matrix completion have 
been proposed, many of which are based on low-rank matrix 
factorization (22) -p4), very few are directly applicable to 
network applications where decentralized processing of data is 
most of the time a necessity. In this paper, we propose a fully 
decentralized algorithm based on Stochastic Gradient Descent 
(SGD), which is founded on the stochastic optimization theory 
with nice convergence guarantees (25) . 

The so-called Decentralized Matrix Factorization by 
Stochastic Gradient Descent (DMFSGD) algorithm has two 
distinct features. First, it requires neither explicit constructions 
of matrices nor special nodes such as landmarks and central 
servers where measurements are collected and processed. In- 
stead, by letting network nodes exchange messages with each 
other, matrix factorization is collaboratively and iteratively 
achieved at all nodes, with each node equally retrieving a num- 
ber of distance measurements. Second, the algorithm is simple, 
with no infrastructure, and is computationally lightweight, 
containing only vector operations. These features make it 
suitable for dealing with practical problems, when deployed in 
real applications, such as measurement dynamics where net- 
work measurements vary largely over time and network churn 
where nodes join and leave a network frequently. Extensive 
experiments on various publicly-available RTT datasets show 
not only the scalability and the accuracy of our approach but 
also its usability in real Internet applications. 
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Our preliminary work on decentralized matrix factorization 
for network distance prediction was published in fTT) . In the 
present paper, we make the following distinct contributions: 

• Our previous approach in (fTT) was based on Alternating 
Least Squares (ALS), requiring each node to probe all 
local measurements simultaneously. In contrast, the new 
SGD-based approach allows each node to probe one 
measurement at a time, making the system more flexible. 
The new approach also addresses several issues arising 
when applied practically, including the difficult choice 
of the learning rate parameter and the consideration of 
passive distance acquisitions and dynamic measurements. 

• We compare comprehensively matrix factorization and 
Euclidean embedding to reveal the suitability of matrix 
factorization. A unified view is provided which leads to 
a unified optimization framework to solve both of them. 

• Two extensions of the current matrix factorization model 
are proposed, including the incorporation of a robust loss 
function and the introduction of constraints in the model 
to ensure the nonnegativity of the predicted distances. 
These extensions are found helpful in improving the 
accuracy of the prediction and require little modification 
to the algorithm with no additional computational cost. 

• In addition, more extensive evaluations have been carried 
out to study not only the impact of the parameters but also 
the accuracy of our approach. In particular, we highlight 
the usability of our approach by simulations on real 
dynamic data collected from a peer-to-peer file sharing 
application, namely Azureus (2), (T3J. 

The rest of the paper is organized as follows. Section [H] 
summarizes the related work on network distance prediction 
based on Euclidean embedding and matrix factorization. Sec- 
tion III introduces the formulation of network performance 



prediction as matrix completion and its resolution by low-rank 
matrix factorization. Section |IV] describes the decentralized 
matrix factorization algorithm based on Stochastic Gradient 
Descent. Section[V]discusses possible extensions to the current 
matrix factorization model. Section VI evaluates our approach 
on various publicly available datasets of RTT Conclusions and 



future work are given in Section VII 



II. Related Work 

Among numerous works on network distance prediction, 
we only discuss and compare approaches based on Euclidean 
embedding and on matrix factorization due to their simplicity 
and generality. We refer the interested readers to fl2) for a 
more detailed review of this field. 

A. Euclidean Embedding 

A straightforward approach to network distance prediction is 
to embed network nodes into a metric space where each node 
is assigned a coordinate from which distances can be directly 
computed. Two representatives are Globe Network Positioning 
(GNP) (8) and Vivaldi (9). 

GNP firstly proposed the idea of network embedding that 
relies on a small number of landmarks. Based on inter- 
landmark distance measurements, the landmarks are first em- 
bedded into a metric space such as Euclidean or spherical 



coordinate systems. Then, the ordinary nodes calculate their 
coordinates with respect to the landmarks. Vivaldi extended 
GNP in a decentralized manner by eliminating the landmarks. 
It simulates the network by a physical system of spring and 
minimizes its energy according to Hooke's law to find an 
optimal embedding. 

In all metric spaces, distances undergo two important prop- 
erties: 

. Symmetry: d(A, B) = d(B, A); 

. Triangle Inequality: d(A, B) + d{B, C) ^ d{A, C). 

However, network distances are not necessarily symmetric 
especially when represented by one-way delays [26 1, [27]. The 
bigger issue is the property of triangle inequality. Many studies 
have shown that the violations of triangle inequality (TIV) are 
widespread and persistent in current Internet (9), p4)-p8). 
In the presence of TIVs, metric space embedding shrinks the 
long edges and stretches the short ones, degrading heavily the 
accuracy of the embedding. Figure [TJ illustrates the idea of 
Euclidean embedding for network distance prediction and the 
impact of TIVs on the accuracy. 

Without loss of generality, we focus on the simplest metric 
space, namely Euclidean coordinate systems, in the rest of the 
paper. 



B. Matrix Factorization 

Alternatively, matrix factorization has also been used for 
network distance prediction (see Figure [2] for an illustration). 
The biggest advantage of matrix factorization is that it makes 
no assumption of Euclidean distance properties and thus can 
tolerate the widespread TIVs and the possible asymmetry in 
network distance spaces. 

The first system based on matrix factorization was Inter- 
net Distance Estimation Service (IDES) [10] which has the 
same landmark-based architecture as GNP. IDES factorizes a 
small but full inter-landmark distance matrix, at a so-called 
information server, by using Singular Value Decomposition 
(SVD). Similarly, Phoenix treated the early-entered nodes 
as landmarks and allowed an ordinary node to select any 
existing nodes in the system which already have coordinates 
assigned |28|. Landmark-based systems suffers from several 
drawbacks including single-point failures, landmark overloads, 
and potential security problems. The selection of landmarks 
can also affect the accuracy of the prediction. Moreover, in 



Section III-D we will show that landmark-based approaches 
are actually special cases of a general decentralized matrix 
factorization model and thus can also be solved by our 
approach. 

III. Network Distance Prediction by Matrix 
Factorization 

This section formulates the problem of network distance 
prediction as matrix completion and describes its resolution 
by matrix factorization. We also provide a unified view of dif- 
ferent approaches to network distance prediction, the insights 
of which lead to a unified optimization framework. 
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Fig. 1. Network distance prediction by Euclidean Embedding. 
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Fig. 2. Network distance prediction by matrix factorization. Note that the diagonal entries of D and D are empty. 



A. Problem Formulation 

Assuming n nodes in the network, a n x n distance matrix 
is constructed with some distances between nodes measured 
and the others unmeasured. Denote D the measured distance 
matrix with the measured distance from node i to node 
j and D the predicted distance matrix with dij the predicted 
distance computed from some function. 

Given the above notations, network distance prediction can 
be viewed as a matrix completion problem that estimates 
the missing entries in D from a small number of known 
entries pO) . Its resolution generally amounts to minimizing 
a loss function of the following form 



L(D,D,W)= 



(i) 



where W is a weight matrix with w tJ taking values between 



and 1. In a simple case, 



= 1 if da is measured 



and otherwise. Note that if the distance measurements are 
RTTs, then dji = d^ as RTTs are approximately symmetric. 
Consequently, Wji — Wij as dji and d^ are either both known 
or both unknown. 

I is a loss function that penalizes the difference between an 
estimate and its desired or true value. The most commonly- 
used loss function is the L 2 or square loss function, 



l(d,d) = (d-d) 2 . 
We will discuss other loss functions in Section [V] 



(2) 



B. Low-Rank Approximation and Matrix Factorization 

Additional constraints are needed to solve the matrix com- 
pletion problem in eq. [T] A common approach is to constrain 
the rank of the approximate matrix D so that 



Rank(D) = r, 



(3) 



where r <§; n for D of size n x n 

The assumption in this low-rank approximation is that the 
entries of D are largely correlated, which causes D to have 
a low effective rank. To show that it holds for our problem, 
Figure [3] plots the singular values of two RTT matrices. It 
can be seen that the singular values of both matrices decrease 
fast as the 10th singular values are 5.7% and 2.9% of the 
largest ones respectively, indicating strong correlations in 
them. The low-rank nature of many other RTT datasets have 
been previously reported in f29) . 

Directly finding D by minimizing eq. [T] subject to eq. [3] is 
considerably difficult due to the rank constraint. However, as 
D is of low rank, we can factorize it into the product of two 
smaller matrices, i.e., 



D = XY 1 



(4) 



where X and Y are of size n x r. Therefore, we can get rid 
of the rank constraint by replacing D by XY T in eq. Ill and 
then look for X and Y instead by minimizing 

n 

L(D, X, Y,W)=J2 mAdij,Xiyj), (5) 

where Xi is the zth row of X, is the ith row of Y, and 
XiyJ = d^ is the estimate of dy. Note that the factorization 
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Fig. 4. Architectures of landmark-based, the left plot, and decentralized, the 
right plot, systems for network distance prediction. The squares are landmarks 
and the circles are ordinary nodes. The directed path from node i to node j 
means that node i probes node j and therefore Wij = 1. 



Fig. 3. The singular values of a RTT matrix of 2255 X 2255, extracted from 
the Meridian dataset 1 30 1 and called "Meridian2255", and of a RTT matrix of 
525 X 525, extracted from the P2psim dataset 1 30] and called "P2psim525". 
The singular values are normalized so that the largest singular values of both 
matrices are equal to 1. 



of X and Y which produce the same D, the incorporation 
of the regularization will force to choose the pair with the 
smallest norm. 



in eq. |4] has no unique solution as 

D = XY T = XGG~ 1 Y T , (6) 

where G is any arbitrary rxr invertible matrix. Thus, replacing 
X by XG and Y T by G~ X Y T results in the same D. 

Generally, the class of techniques to solve the low-rank 
approximation is matrix factorization. When D is complete, 
analytic solutions can be found by using singular value de- 
composition (SVD) pT) . With missing entries, the factoriza- 
tion is usually done by iterative optimization methods such 
as Gradient Descent or Newton algorithms (32). Note that 
additional constraints can be imposed in eq.[5] For instance, the 
entries of X and Y can be required to be nonnegative in order 
to recover a nonnegative matrix, leading to the nonnegative 
matrix factorization (NMF) [33 1. 

C. Incorporation of the Regularization 

Matrix completion by matrix factorization suffers from a 
well-known problem called overfitting in the field of machine 
learning (34) . In words, directly optimizing eq. [5] often leads 
to a "perfect" model with no or small errors on the training 
data while having large errors on the unseen data which are 
not used in learning. The problem is more severe when D is 
sparse or when r is large. 

A common way to avoid overfitting is through regularization 
that penalizes the norms of the solutions, resulting in the 
following regularized loss function, 

L(D,X,Y,W,\)= (7) 

n n n 

i.j — l i—1 i—1 

where A is the regularization coefficient that controls the extent 
of regularization. 

Besides avoiding overfitting, the regularization also helps 
overcome the drift of the solutions due to the non-uniqueness 
of the factorization (see eq. [6), which often leads to the 
overflows of the solutions. Among the infinite number of pairs 



D. A Unified View of Approaches to Network Distance Pre- 
diction 

Although popular approaches to network distance predic- 
tion vary by adopting various models including Euclidean 
embedding and matrix factorization and by adopting different 
architectures of either landmark-based or landmark-less and 
thus decentralized, these seemingly different approaches all 
optimize the same function in eq. [T] but differ only in the 
setting of w%j and in the associated distance functions to 
calculate dij, 

• Setting of ro,j : For landmark-based methods, as all paths 
between landmarks are measured and ordinary nodes 
probe only the landmarks, 

{1 if node j is a landmark 
otherwise 

For decentralized methods, as each node equally probes 
a number of nodes, 

{1 if node i probes node j 
otherwise 

Figure |4] illustrates the architectures of landmark-based 
and decentralized systems. 

• Distance functions to calculate dif For matrix factoriza- 
tion, as described above, 

dij = Xi~yJ, (8) 

For Euclidean embedding, the Euclidean distance is de- 
fined as 

dij = yj (xi - Xj) T (xi - Xj), (9) 

where Xi and Xj are the Euclidean coordinates of node i 
and node j. 

The above insights suggest a unified framework to treat and 
to solve equally network distance prediction under different 
models and different architectures. For instance, the decentral- 
ized matrix factorization algorithms proposed in the following 
sections can be used to solve both Euclidean embedding and 
landmark-based systems with little modification. 
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IV. Decentralized Matrix Factorization for 
Network Distance Prediction 

The goal is to find X and Y such that XY T best approxi- 
mates D by minimizing Eq. [JJ Commonly, it requires to collect 
and to process a number of distance measurements at a central 
node, which is a major obstacle to network applications. 
Below, we introduce algorithms to minimize eq. [7] in a fully 
decentralized manner. 



A. Problem Formulation 

A decentralized resolution of eq. [7] forbids the explicit 
constructions of large matrices. Thus, each row of X and of 
Y, denoted by Xi and yi and called the x and y coordinates of 
node i in the sequel, are stored distributively at each node. To 
calculate Xi and y t locally, we let node i probe and exchange 
messages with a number of nodes in the network, called 
neighbors in the sequel, and optimize the following losses 



k = "''Mil, j ■■'■,!/' I + ax { x 

3=1 
n 



T 

i ' 



T 

i ' 



(10) 



(11) 



where Wj represents the neighboring relationship of node j 
to node i, i.e., Wj — 1 if node j is a neighbor of node i or 
equivalently if node i probes node j, and otherwise. Note 
that as Wj is not necessarily equal to w J i7 measurements dij 
and dji may only be known by one node of either node i or 
node j but not by the other. 

Essentially, U is the regularized loss of the edges from node 
i to other nodes and l % is that of the edges from other nodes to 
node i. Thus, we address the large-scale optimization problem 
in eq. 17] by decomposing it into a number of subproblems in 
eqs. [TO] and 1 1 which can be solved locally at each node by 
using only local measurements. 



In seeing that eqs. 10 and 11 are standard least-squares 
problems where analytical solutions exist, our previous work 
in fTT) solved the matrix factorization problem by Alternat- 
ing Least Squares (ALS), which alternatively and iteratively 



solves the small least-squares problems in eqs. 10 and 11 



While the ALS-based algorithm performed well in simula- 
tions on datasets containing static measurements, it requires 
each node to probe measurements with a number of nodes 
simultaneously, which is impractical when deployed in real 
applications. Below, we propose a different algorithm based 
on Stochastic Gradient Descent (SGD) that processes, at each 
node, measurements one by one and one at a time. 

B. Stochastic Gradient Descent (SGD) 

SGD is a variation of traditional Batch Gradient Descent 
which is often used for online machine learning (25). Instead 
of collecting all training samples beforehand and computing 
the gradients over them, each iteration of SGD chooses one 
training sample at random and updates the parameters being 
estimated along the negative gradients computed over that 
chosen sample. SGD is particularly appropriate for network 



applications, as measurements can be acquired on demand and 
processed locally at each node. It also has simple update rules 
that involve only vector operations and is able to deal with 
large-scale dynamic measurements. 

1 ) Stochastic Updates: When using SGD, each node probes 
one neighbor at a time, measures its distance with respect to 
that node and retrieves that node's coordinates. Let node j be 
the chosen neighbor by node i at the current time. Then, the 
regularized losses that node i seeks to reduce with respect to 
node j are 

l(d lj ,x i yj) + \x l x i J ', (12) 

(13) 



Iji = l{dji,xjyj) + Xyiyi T 
The gradients of Uj and Iji are 

dxi 
dljj 
dyi 



dl(dij,Xiyf) 

h AXi, 

OX l 

dl(dji,Xjy?) 



Ay, 



In particular, the gradients of the L2 loss function are 

8l_ 

dxi 



-(dij - XiyJ)yj, 
^7 — — Xjy { )xj. 



dl 



(14) 
(15) 

(16) 
(17) 



Note that we dropped the factor 2 from the derivatives of 
the regularization terms and of the L2 loss function for 
mathematical convenience. 

Then, node i updates its coordinates along the negative 
gradient directions, given by 

Xi = (1 - r)X)Xi + r)(dij - x l yj)y j , (18) 
y l = (1 - r)\)yi + r](dji - x yj)xj, (19) 

where 77, called learning rate or step size, controls the speed 
of the updates. 

2) Minibatch and Line Search: The SGD algorithm is 
sensitive to the learning rate 77, where a too large -q results in 
large steps of updates and may overflow the solution, whereas 
a too small 77 makes the convergence slow. This sensitivity can 
be relieved by using more training samples at the same time, 
leading to minibatch SGD with the following update rules 



(1 - ?7A)a; i + 77 w} (djj - Xiyj)yj 



(20) 



x J yf)x j . (21) 



Vi = (1 - V^)Vi + V X! w ) ( d 
j=l 

To completely get rid of 77, a line search strategy can be 
incorporated to determine 77 adaptively (35). In particular, in 
each update, 77 starts with a large initial value and is gradually 
decreased until the losses in eqs 



10 



or 



1 1 are reduced after the 
update. The line search algorithm for updating Xi is given in 
Algorithm [T] The same algorithm can be used for updating yi 
by replacing eq. [I0]by eq. [TT]and eq. |20]by eq. |2T] Note that 
S in Line 6 is a small positive constant that helps overcome 
poor local optimums. We will demonstrate the effectiveness 
of adapting 7/ by line search using Algorithm [T] in Section VI 
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Algorithm 1 Line Search (for updating Xi) 



Algorithm 2 DMFSGD(z, j) 



10 



compute q by eq. 
initialize r\ with a large value; 
for i = 1 to max N umber LineS ear ch do 
compute Xi by eq.j20j 
compute li by eq. |10| 
if k <lf + 5 then 

return 
end if 

end for 



C. Neighbor Decay and Neighbor Selection 

As mentioned earlier, it is preferable to have a system that 
probes and processes measurements one by one. Thus, we let 
each node maintain the information (distance measurements 
and coordinates) of its neighbors, i.e., the nodes with which 
it communicates. In minibatch SGD, each node probes one 
neighbor at a time but updates its coordinates with respect to 
all neighbors in the neighbor set using their recorded historical 
information. 

A neighbor decay strategy is incorporated that scales the 
weight of each node in the neighbor set by its age so that 
older information receives less weight, i.e., 



GNcighborSct(i) 



a o)' 



(22) 



where a 3 is the age of the information of node j and a max is 
the age of the oldest information in the neighbor set. Note that 
this neighbor decay strategy was firstly proposed by p3) to 
overcome the problem of skewed neighbor update in Vivaldi. 
In words, some nodes may be probed at far greater frequency 
than others due simply to their longer life cycles and a direct 
consequence is that the optimization will become skewed 
toward these nodes. 

Conventionally, the neighbors of a node are selected ran- 
domly and the distances between a node and its neighbors are 
probed by active measurements [9|. However, in practice, it 
is more attractive to perform the updates of the coordinates 
passively without generating any extra traffic. In some appli- 
cations such as Azureus, passivity is enforced, as we have 
no control over the selection of neighbors with which a node 
communicates and when it communicates with them fT3) . 

Therefore, we differentiate the situations where distances 
are probed by active and passive measurements. For the 
former, the conventional random neighbor selection procedure 
is adopted, i.e., each node randomly selects k nodes as its 
neighbors and actively probes one of them from time to time. 
For the latter, no neighbor selection is performed explicitly 
and each node maintains a relatively small set of active 
neighbors with which it recently communicated and updates its 
coordinates whenever a new measurement is made available. 
Note that this difference has no impact on the update rules in 



node i retrieves dij,dji,: 



Uj actively or passively; 



node i updates the weights of its neighbors by eq. 22 
update Xi by eq. [20] with r\ set by line search; 
update fji by eq. |21| with rj set by fine search; 



D. Algorithm 

We denote the SGD-based decentralized matrix factoriza- 
tion algorithm as DMFSGD, given in Algorithm [2] Like 
Vivaldi (9), our DMFSGD algorithm has no infrastructure and 
employs the same process at all nodes. It is simple, with update 
rules containing only vector operations. 

In the implementation, the coordinates of each node are 
initialized with random numbers uniformly distributed be- 
tween and 1. Empirically, the algorithm is insensitive to 
the random initializations of the coordinates. We would like 
to point out that the algorithm is one of those randomized 
gossip algorithms where each node exchanges messages with 
a number of other nodes randomly (36). 

As mentioned earlier, the algorithm is generic and can also 
deal with landmark-based architectures, by letting each node 
only select landmarks as its neighbors, and with Euclidean 
embedding, by adopting the Euclidean distance defined as 
eq. [9] when optimizing eq. [T] which leads to the update rule 
of Vivaldi [9], given by 



dx. 



Xi + r](dij - dij)'- 



Note that Vivaldi adopted the L 2 loss function in eq. [T] with 
no regularization incorporated, and the learning rate m termed 
differently as timestep, was adapted by taking into account 
some confidence measure of each node to its coordinate. Thus, 
Vivaldi can be viewed as a SGD-based decentralized Euclidean 
embedding algorithm, instead of the simulation of a spring 
system in (5J. 

V. Extended Matrix Factorization Models 

This section discusses possible ways to extend the common 
matrix factorization model. 

A. Robust Matrix Factorization 

The widely-used L 2 loss function is known to be sensitive 
to outliers which often occur in network measurements due 
to network anomaly such as sudden traffic bursts and attacks 
from malicious nodes. Other loss functions such as the L\ 
loss function, the e-insensitive loss function and the Huber 
loss function are more robust and can tolerate outliers [37] , 
|38|. For example, the L\ loss function is defined as 



l{d,d) = \d-d\. 



(23) 



eqs 18 and 19 or in eqs 20 and 21 



Thus, we can potentially enhance the robustness of matrix 
factorization by replacing the L 2 loss function by e.g. the L\ 
loss function, and the same SGD procedure can be applied to 
solve the robust matrix factorization problem. Note that the 
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L\ loss function is non-differentiable and the gradients have 
to be approximated by the subgradients [j] given by 

dl 

— — = -sign(dy - Xiyj)yj, (24) 
dl 



Algorithm 3 Extended_DMFSGD(z, j) 



-sign(d i i - )xj 



(25) 



Replacing the gradient functions of eqs. [14] and [15] by eqs. [24] 



and 25 the update rules of minibatch SGD become 

n 

Xi = (1 - r)\)xi + T) sign(dij - x i yj)w l j y j , 



(26) 



n 

2/i = (1 - + r / E si S n (4 



3» 



)n 'r v ,■ (27) 



Comparing eqs. 26 and 27 with eqs. 20 and 21 the only 
difference is that for the L 2 loss function, the magnitudes of 
the updates are proportional to the fitting errors (d — xy T ), 
whereas for the L\ loss function, only the signs of the fitting 
errors are taken into consideration and decide the directions 
of the updates. 



node i retrieves dij,dji,Xj,yj actively or passively; 



node i updates the weights of its neighbors by eq. 22 
if use L 2 loss function then 

update Xi by eq. [20] with rj set by line search; 

update yi by eq. ^Hwith r\ set by line search; 
else // use L\ loss function 

update Xi by eq.]26]with r\ set by line search; 

update yi by eq. |27| with 77 set by line search; 
end if 

if force nonnegativity then 

turn the negative entries in Xi and yi into 0; 
end if 



As distances are defined as in eq. 29 a systematic solution 
is to factorize D by optimizing 

n n n n 

Y Y Wi i 1 ^ < $3 ) + A Y XiX i + x Y • (30) 

i— 1 j — 1 i—1 i—1 

Similar SGD update rules can be derived. 



B. NonNegativity Constraint 

Conventional matrix factorization techniques do not pre- 
serve the nonnegativity of the distances. Empirically, only 
a very small portion of the predicted distances were found 
negative by our DMFSGD algorithm, and a direct solution is 
to turn dij into a small positive value if dij — XiyJ < 0. 

A systematic solution is to incorporate the nonnegativity 
constraint in matrix factorization, leading to the nonnegative 
matrix factorization (NMF) that optimizes 



Y w l0 l{di j ,x i y 1 :) + A 



iY x * x T 

i=l 



+ ^Y yiy ^ 



(28) 



subject to Xi ^ 0, yi ^ 0, 



»=i 
i = 1, 



The optimization of NMF is not fundamentally different from 
that of the unconstrained matrix factorization, adding only one 
projection step that turns the negative entries in Xi and j/, into 
zero after each SGD update which causes no noticeable impact 
on the speed of the algorithm. The technique is also known 
as projected gradient descent |39j. 

Note that the nonnegativity constraint has been previously 
studied in pO] , |28|, both of which adopted a more heavy- 
weight nonnegative least-squares solver. 

C. Symmetric Distance Matrix Factorization 

Also note that network distances are symmetric if repre- 
sented by RTT and that this symmetry is not preserved either. 
A direct solution is to turn the predicted distances symmetric 
by defining a symmetric distance function as 



3m _ d *3 + 

% - 2 



X 3Vt 



(29) 



'Analogously, the subgradient-based technique that optimizes non- 
differentiable functions is called subgradient descent |35| . Following the 
convention in [25 ] , we use the term SGD to refer to both Stochastic Gradient 
and SubGradient Descent. 



D. Height Model 

The height model in Vivaldi [9] can also be incorporated. 
This model augments the x and y coordinates of a node 
with a height. Similarly, the x and y coordinates model the 
high-speed Internet core, while the height models the time 
packets take to travel the access link from the node to the 
Internet core. The cause of the access link distance includes 
queuing delay and low bandwidth [9|. The height augmented 
symmetric distance is defined as 



a ij - 2 



T 

x iVi 



hj. 



(31) 



Correspondingly, the loss function to be optimized becomes 

n n 

; W )i{d l3 , + \Y + *Y Mi- w 



EE'' 



i=i 



i=l j=l 

Similar SGD update rules can be derived. 



i=l 



E. Extended DMFSGD Algorithm 

Empirically, we found no or little improvements by incorpo- 
rating the symmetric or height-augmented symmetric distance 
function in eq. [29] or [5T| thus include neither of them in 
our system. However, the incorporation of the nonnegativity 
constraint and the robust loss function not only improved 
the accuracy but also made the results more stable and less 
sensitive to parameter settings, which will be demonstrated 
in Section VI] The extended DMFSGD algorithm is given in 
Algorithm 3] Note that as the basic version in Algorithm [2] is 
a special case of the extended version in Algorithm [3] we will 
refer to Algorithm [3] simply as DMFSGD in the sequel. 

VI. Experiments and Evaluations 

In this section, we evaluate our DMFSGD algorithm^] and 
compare it with state-of-the-art approaches. 

2 The source code of the algorithm will be publicly available soon. 
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A. Evaluation Methodology 

The evaluations were performed under the following criteria 
and on the following datasets. 
1) Evaluation Criteria: 

• Cumulative Distribution of Relative Estimation Error 

Relative Estimation Error (REE) is defined as 



REE = 



\dij- di 



TABLE I 
Properties of The Datasets 



Dataset 


Nodes 


Symmetry 


TIV percentage 


Dynamic 


Harvard226 


226 


/ 


/ 


Yes 


P2PSiml740 


1740 


Yes 


85.53% 


No 


Meridian2500 


2500 


Yes 


96.55% 


No 


P2PSim525 


525 


Yes 


76.17% 


No 


Meridian2255 


2255 


Yes 


96.25% 


No 


SyntheticlOOO 


1000 


Yes 


No 


No 



Stress Stress measures the overall fitness and is used to 
illustrate the convergence of the algorithm, defined as 



stress 



Median Absolute Error Median Absolute Error (MAE) 
is defined as 



MAE = mediariij (\di 



dij\). 



2) Datasets: 

• Harvard226 contains dynamic and passive measure- 
ments of application-level RTTs, with timestamps, be- 
tween 226 Azureus clients collected in 4 hours p3| . 

. P2PSiml740 was obtained from the P2PSim project that 
contains static RTT measurements between 1740 Internet 
DNS servers (40), (41). 

• Meridian2500 was obtained from the Cornell Meridian 
project that contains static RTT measurements between 
2500 nodes (30). 

• P2PSim525 is a complete submatrix between 525 nodes 
derived from P2psiml740. 

• Meridian2255 is a complete submatrix between 2255 
nodes derived from Meridian2500. 

• SyntheticlOOO contains the pairwise distances between 
1000 nodes that are randomly generated in a 10- 
dimensional Euclidean space. 

The first five datasets were obtained from real-world networks 
and contain a large percentage of TIV edges, whereas the last 
one was synthesized and is TIV free. Here, an edge AB is 
claimed to be a TIV if there exists a triangle AABC where 
AB > BC + AC. The last three datasets were only used 
in section VI-B for the purpose of comparing the models of 
Euclidean embedding and matrix factorization. 

Table U summarizes these datasets. Note that we can neither 
tell the symmetry nor calculate the TIV percentage of the 
Harvard226 dataset, as the measurements between network 
nodes vary over time largely, sometimes in several orders of 
magnitudes. The Harvard226 dataset is rather dense with about 
3.9% pairwise paths unmeasured in 4 hours. The other paths 
are measured in uneven frequencies with one measured the 
maximal 662 times and one the minimal 2 times. About 94.0% 
of the paths are measured between 40 and 60 times. 

3) Implementations for Different Datasets: As mentioned 
earlier, the DMFSGD algorithm adopts the conventional ran- 
dom neighbor selection procedure in the scenarios where mea- 
surements are probed actively and maintains dynamically an 



active neighbor set for each node in the scenarios where mea- 
surements are obtained passively. Thus, for the Harvard226 
dataset, we let each node maintain an active neighbor set 
containing the nodes it has contacted within the past 30 
minutes and the timestamped measurements are processed 
in time order. For the other datasets, the random neighbor 
selection is used and the measurements are processed in 
random order with no neighbor decay (Line 2 in Algorithm [3]) 
as they are static. 

To handle the dynamics of the measurements in Harvard226, 
the distance filter in (13) is adopted that smooths the streams 
of measurements within a moving time window, 30 minutes 
in this paper, by a median filter. In the evaluation, we built a 
static distance matrix by extracting the median values of the 
streams of measurements between each pair of nodes and used 
it as the ground truth. 

B. Euclidean Embedding vs. Matrix Factorization 

Euclidean embedding and matrix factorization both solve 
the same problem in eq. [T]but subject to different constraints. 
Euclidean embedding requires D to be symmetric and to 
satisfy the triangle inequality, whereas matrix factorization 
only requires D to be low rank. Below, we compare em- 
pirically Euclidean embedding and matrix factorization to 
show whether this difference in constraints makes matrix 
factorization more suitable for network distance prediction. 

1) Algorithms: To make the model comparison fair, we 
chose the state-of-the-art algorithms to solve the Euclidean 
embedding and matrix factorization problems so that both 
are solved to their limits. For Euclidean embedding, Multi- 
Dimensional Scaling (MDS) is the most popular technique 
that searches the optimal embedding using an iterative algo- 
rithm. We adopted the MDS implementation, mdscale, in the 
statistical toolbox of matlab |42|. 

For matrix factorization, SVD provides the analytic solution 
which is globally optimal (31) . Generally, SVD factorizes a 
given matrix D into three matrices of the form 

D = USV T , 

where U and V are unitary matrices, and S is a diagonal 
matrix with nonnegative real numbers on the diagonal. The 
positive diagonal entries are called the singular values and 
their number is equal to the rank of D. 

To obtain a low-rank factorization, we keep only the r large 
singular values in S and replace the other small ones by zero. 
Let S r be the new S, X = US? and Y T = S?V T , where 
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Fig. 5. Comparison of MDS-based Euclidean embedding and SVD-based matrix factorization on synthetic 1000, P2psim525 and Meridian2255. The stresses 
and the median absolute errors by both methods in different dimensions/ranks are shown on the first and second rows respectively. Note that a perfect 
embedding with no errors was generated for Synthetic 1000 in the 10 dimensional Euclidean space by MDS. 



TABLE II 

Matrix Factorization vs. Euclidean Embedding 





Matrix Factorization 


Euclidean Embedding 


node coordinate 


%i — (^41 ; ' ' ' i -Ezr) 
Vi = (Vil,--- ,Vir) 


Xi — (x^\ ) ' ' ' i -^ir) 


distance function 




dij — \/ (x^ X j)"^ 1 (x^ Xj ) 


constraints 


low rank 


Symmetry: dij = dji 
Triangle Inequality: dij < d^ + d^j 



S r (i,i)z = y/ S r (i, i). Then, D = XY T is the optimal low- 
rank approximation to D. 

2) Evaluations: Since SVD cannot handle missing data, we 
only compare MDS and SVD on the three complete datasets 
including Synthetic 1000, P2psim525 and Meridian2255. On 
each dataset, we ran MDS and SVD in different dimensions 
and ranks and computed the stresses and MAE, shown in 
figure [5] It can be seen that the accuracies by SVD mono- 
tonically improve on all three datasets as the rank increases, 
whereas consistent improvements by MDS are only found 
on SyntheticlOOO which is TlV-free. On P2psim525 and 
Meridian2255 where severe TIVs exist, MDS achieves no or 
little visible improvement after 10 dimensions. 

These evaluations demonstrate the influences of different 
constraints imposed on the two techniques. For Euclidean em- 
bedding, the symmetry constraint doesn't cause any problem 
as the RTTs in all datasets are symmetric. However, the con- 
straint of triangle inequality is strong and can't be relieved by 
increasing dimensions. In contrast, matrix factorization makes 
no assumptions of triangle inequality, thus is not affected by 
the TIVs in the data. Note that the accuracy improvement 
by increasing the rank is guaranteed for SVD-based matrix 
factorization. However, this conclusion cannot be extended to 



the cases where missing data is present. We will show later that 
increasing the rank beyond some value in matrix factorization 
for a large amount of missing data will not further improve 
the accuracy. 

This comparative study reveals the model advantages of 
Matrix Factorization over Euclidean embedding. Overall, Eu- 
clidean embedding has a geometric interpretation which is 
useful for visualization. However, due to the existence of TIVs 
and the possible asymmetry in network distance spaces, low- 
rank matrix factorization is more suitable for modeling the 
network distance spaces. Table [IT] compares the main features 
of matrix factorization and Euclidean embedding. 

C. Impact of Parameters 

This section discusses and demonstrates the impact of the 
parameters of our DMFSGD algorithm. 

1) k, r and X: Our DMFSGD algorithm has two main pa- 
rameters, the regularization coefficient A and the rank r. Active 
probing introduces one additional parameter, the number of 
neighbors k that are selected for each node. Intuitively, r is 
the number of unknown variables in each coordinate and k 
is the amount of known data that is used to estimate each 
unknown variable in a coordinate. A controls the extent of the 
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regularization which avoids both overfitting and drift of the 
coordinates. 

Clearly, increasing k is equivalent to adding more data 
and thus always helps improve the accuracy. However, a 
larger k also means more probe traffic and consequently 
higher overheads. On the other hand, only a certain number 
of unknown variables can be accurately calculated from a 
certain amount of known data. Thus, increasing r beyond some 
value for a fixed k will only lead to severe overfitting and 
consequently, a large A is needed to address it. 

In practice, k should be fixed according to the require- 
ments of the applications by trading off between accura- 
cies and measurement overheads. Following the suggestion 
in Vivaldi 15), we set k = 32 for P2psiml740 and for 
Meridian2500 in the rest of the paper. Note that k = 32 
makes the available measurements considerably sparse. For 
instance, 32/1740 = 1.84% measurements are available for 
each node in P2PSiml740 and 32/2500 = 1.28% for each 
node in Meridian2500. Recall that no A; is set for Harvard226. 

2) Experiments under Different Configurations: We then 
experimented with different configurations of r = {3, 10, 100} 
and A = {0.01,0.1,1,10}, with different loss functions and 
whether to incorporate the nonnegativity constraint, shown in 
Figure [6] rj is adapted by the line search, with the initial value 
of 10 - for the L2 loss function and of 10~ 2 for the L\ loss 
function. 

In particular, we made the following observations. First, 
the DMFSGD algorithm is generally more accurate when the 
robust Li loss function and the nonnegativity constraint are 
incorporated. The likely reasons are that the L\ loss function is 
insensitive to large fitting errors some of which are introduced 
by measurement outliers and that the nonnegativity constraint 
reduces the searching space which makes it easier to find 
a stable solution. Thus, the robust L\ loss function and the 
nonnegativity constraint are incorporated in the DMFSGD 
algorithm by default. 

Second, A = 1 seems to be a good choice under most 
configurations and is thus adopted by default. Third, the impact 
of r depends on the properties of the dataset. In Harvard226 
where available measurements are dense, the prediction ac- 
curacy improves monotonically with r, whereas in the other 
two datasets where available measurements are sparse due to 
the setting of a small k, better performance is achieved with 
r ^ 10 and a large A is needed to overcome the overfitting 
caused by larger r's, which confirms our analysis in the 
previous section. Thus, by trading off between the performance 
on all three datasets, r = 10 is adopted by default. 

3) r\: As mentioned earlier, SGD is sensitive to the learning 
rate r\ where a too large 77 leads to the overflow of the solutions 
and a too small r\ slows down the convergence. Although this 
sensitivity is reduced by minibatch SGD, it is still difficult to 
find an appropriate constant that works for all datasets and in 
all situations. We experimented with different constant 77's and 
with the line search to adapt r\ dynamically. Results are shown 
in Figure [7] It can be seen that the line search strategy performs 
best in terms of both accuracy and convergence speed. Note 
that the convergence speed is illustrated by the stress and 
MAE improvements with respect to the average measurement 



number per node, i.e. the total number of measurements used 
by all nodes divided by the number of node^] It can be seen 
that the DMFSGD algorithm converges fast after each node 
probe, on average, 10 x k measurements from its k neighbors. 
Although no k is set for Harvard226, we treat it as k = 226. 

4) Discussions: By incorporating the line search strategy, 
the L\ loss function and the nonnegativity constraint, our 
DMFSGD algorithm is left with two tunable parameters: the 
rank r and the regularization coefficient A. The default config- 
uration of A — 1 and r = 10 is not guaranteed to be optimal 
in different situations and on different datasets. However, 
fine tuning of parameters is difficult, if not impossible, for 
network applications due to the measurement dynamics and 
the decentralized processing where local measurements are 
processed locally at each node with no central nodes gathering 
information of the entire network. Empirically, the default 
parameter setting leads to good, though not the best, prediction 
accuracy to a large variety of data. 

The setting of k = 32 has been commonly adopted in 
many systems such as Vivaldi. However, most systems contain 
network nodes of a few thousands or less. For large systems 
of more nodes, k has to be scaled with the number of nodes 
n. According to the theory of matrix completion p5)-pl), 
one can recover an unknown nx n matrix of low rank r from 
just about 0(nr\ogn) noisy entries with an error which is 
proportional to the noise level. Thus, k oc rlogrt to guarantee 
a decent prediction accuracy. 

D. Comparisons with Vivaldi 

Among numerous approaches on network distance pre- 
diction, we consider Vivaldi [9] as the state of the art be- 
cause of its accuracy and its practicability. To the best of 
our knowledge, Vivaldi is the only system that has been 
actually adopted in a real application, namely Azureus (2). 
Other approaches such as GNP (8) and IDES |9| are less 
convenient due to the usage of landmarks, which makes their 
application impossible in the context of passive probing of 
distance measurements (thus impossible to be evaluated on the 
Harvard226 dataset). Due to the insights in Section III-D we 



consider these landmark-based systems as a special variation 
of a generic decentralized model. 

In this paper, we only compare our DMFSGD algorithm 
with Vivaldi. To address the measurement dynamics and the 
skewed neighbor updates, we adopted the Vivaldi implemen- 
tation in [13] |^] when dealing with the Harvard226 dataset. 
The conventional Vivaldi in [9 1 was adopted to deal with the 
other two datasets. We refer to the former as Harvard Vivaldi 
to make the distinction. In addition, despite the impracticality, 
we also demonstrate the flexibility of the DMFSGD algorithm 
in dealing with the landmark-based architecture, referred to 
as DMFSGD Landmark, by forcing each node to only select 

3 For P2PSiml740 and Meridian2500, at any time, the number of mea- 
surements used by each node is statistically the same for all nodes due to 
the random selections of the source and the target nodes in the updates. For 
Harvard226, this number is significantly different for different nodes because 
the paths were passively probed with uneven frequencies 



4 The source code was downloaded from http://www.eecs.harvard.edu/ 
-'syrah/nc/J 
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the Li loss function and the nonnegativity constraint, k is treated as 226 for Harvard226 and k = 32 



the landmarks as neighbors. Note that we only ran DMFSGD 
Landmark on P2PSiml740 and Meridian2500 because the 
dynamic measurements in Harvard226 were obtained passively 
and thus we cannot select landmarks and force each node to 
only communicate with them. To make the comparison fair, 
32 landmarks were randomly selected. 

Figure [8] shows the comparisons between DMFSGD, Vi- 
valdi/Harvard Vivaldi and DMFSGD Landmark. It can be 
seen that on different criteria, our DMFSGD algorithm either 
outperforms or is competitive with Vivaldi. On Harvard226, 
DMFSGD is significantly better on all criteria, especially 
on the MAE where DMFSGD achieved the 1ms MAE, in 
contrast to the 5ms by Harvard Vivaldi, meaning that half 
of the estimated distances have an error of less than 1ms. 
On P2PSiml740, DMFSGD is better on the MAE and the 
cumulative distributions of REE, whereas on Meridian2500, 
DMFSGD achieved similar performance as Vivaldi on all cri- 
teria. Note that DMFSGD and DMFSGD Landmark performed 
similarly on P2PSiml740 and Meridian2500. 

As Harvard226 contains real dynamic data collected from a 
real application, Azureus, the superiority on it shows clearly 
the usability of our DMFSGD algorithm. 

VII. Conclusions and Future Works 

This paper presents a novel approach to network distance 
prediction by low-rank matrix factorization. The success of the 
approach roots both in the exploitation of the dependencies 
across distance measurements between network nodes and in 
the stochastic optimization which enables a fully decentralized 
architecture. A so-called Decentralized Matrix Factorization 
by Stochastic Gradient Descent (DMFSGD) algorithm is pro- 
posed to solve the distance prediction problem. The algorithm 
is simple, with no infrastructure, scalable, able to deal with 



dynamic measurements in large-scale networks, and accurate, 
generally superior to Vivaldi. 

Extensive experiments on various RTT datasets, particularly 
on one with real data from Azureus, demonstrate the poten- 
tial of the algorithm being utilized by Internet applications, 
which we would like to study in the future. Our approach is 
flexible and can easily be extended to other network metrics 
such as available bandwidth (43). An interesting topic is to 
study which metrics are suitable for our matrix completion 
framework. 
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Fig. 8. Comparison of DMFSGD and Vivaldi. The default configuration of A = 1 and r = 10 with r\ adapted by the line search, the L\ loss function and 
the nonnegativity constraint is used in DMFSGD and DMFSGD Landmark. The 10 dimensional Euclidean space with the Height model is used in Vivaldi 
and Harvard Vivaldi. Note that as the implementation of Harvard Vivaldi only outputs the results in the end of the simulation, the final stress and the final 
MAE are plotted as a constant. 



[7] M. J. Freedman, K. Laskhminarayanan, and D. Mazieres, "OASIS: 
Anycast for any service," in POT3rd Symposium on Networked Systems 
Design and Implementation, San Jose, CA, May 2006. 

[8] T. S. E. Ng and H. Zhang, "Predicting Internet network distance with 
coordinates-based approaches," in Proc. of IEEE INFOCOM, New York, 
NY, USA, Jun. 2002. 

[9] F. Dabek, R. Cox, F. Kaashoek, and R. Morris, "Vivaldi: A decentralized 
network coordinate system," in Proc. of ACM SIGCOMM, Portland, OR, 
USA, Aug. 2004. 

[10] Y. Mao, L. Saul, and J. M. Smith, "IDES: An Internet distance 
estimation service for large networks," IEEE Journal On Selected Areas 
in Communications, vol. 24, no. 12, pp. 2273-2284, Dec. 2006. 

[11] Y. Liao, P. Geurts, and G. Leduc, "Network distance prediction based 
on decentralized matrix factorization," in Proc. of IFIP Networking 
Conference, Chennai, India, May 2010. 

[12] B. Donnet, B. Gueye, and M. A. Kaafar, "A survey on network coordi- 
nates systems, design, and security," IEEE Communication Surveys and 
Tutorial, vol. 12, no. 4, pp. 488-503, 2010. 

[13] J. Ledlie, P. Gardner, and M. I. Seltzer, "Network coordinates in the 
wild," in Proc. of USENIX NSDI, Cambridge, Apr. 2007. 

[14] H. Zheng, E. K. Lua, M. Pias, and T. Griffin, "Internet Routing 
Policies and Round- Trip-Times," in Proc. of the Passive and Active 
Measurement, Boston, MA, USA, Apr. 2005. 

[15] S. Lee, Z. Zhang, S. Sahu, and D. Saha, "On suitability of Euclidean 
embedding of Internet hosts," SIGMETRICS, vol. 34, no. 1, pp. 157-168, 
2006. 

[16] G. Wang, B. Zhang, and T. S. E. Ng, "Towards network triangle 
inequality violation aware distributed systems," in Proc. the ACM/IMC 
Conference, San Diego, CA, USA, Oct. 2007, pp. 175-188. 



[17] S. Banerjee, T. G. Griffin, and M. Pias, "The interdomain connectivity 

of PlanetLab nodes," in Proc. of the Passive and Active Measurement, 

Antibes Juan-les-Pins, France, Apr. 2004. 
[18] E. K. Lua, T. Griffin, M. Pias, H. Zheng, and J. Crowcroft, "On the 

accuracy of embeddings for Internet coordinate systems," in Proc the 

IMC Conference. New York, NY, USA: ACM, 2005, pp. 1-14. 
[19] E. J. Candes and B. Recht, "Exact matrix completion via convex 

optimization," Foundations of Computational Mathematics, vol. 9, no. 6, 

pp. 717-772, 2009. 
[20] E. J. Candes and Y. Plan, "Matrix completion with noise," Proc. of the 

IEEE, vol. 98, no. 6, 2010. 
[21] R. H. Keshavan, S. Oh, and A. Montanari, "Matrix completion from a 

few entries," CoRR, vol. abs/0901.3150, 2009. 
[22] N. S. Nati and T. Jaakkola, "Weighted low-rank approximations," in 

International Conference on Machine Learning, 2003, pp. 720-727. 
[23] Z. Wen, W. Yin, and Y. Zhang, "Solving a low-rank factorization 

model for matrix completion by a non-linear successive over-relaxation 

algorithm," Department of Computational and Applied Mathematics, 

Rice University, Tech. Rep. TR10-07, 2010. 
[24] S. Shalev-Shwartz, A. Gonen, and O. Shamir, "Large-Scale Convex 

Minimization with a Low-Rank Constraint," in International Conference 

on Machine Learning, 2011. 
[25] L. Bottou, "Online algorithms and stochastic approximations," in Online 

Learning and Neural Networks, D. Saad, Ed. Cambridge University 

Press, 1998. 

[26] A. Pathak, H. Pucha, Y. Zhang, Y. C. Hu, and Z. M. Mao, "A 
measurement study of Internet delay asymmetry," in Proc. of the Passive 
and Active Measurement, Cleveland, OH, USA, Apr. 2008. 

[27] Y. He, M. Faloutsos, S. Krishnamurthy, B. Huffaker, Y. He, M. Faloutsos, 



LIAO et at: DMFSGD: A DECENTRALIZED MATRIX FACTORIZATION ALGORITHM FOR NETWORK DISTANCE PREDICTION 



14 



[28] 



[29] 



[30] 



[31] 
[32] 



[33] 



[34] 



[35] 
[36] 



[37] 



[38] 



[39] 



[40] 
[411 



[421 
[43] 



S. Rrishnamurthy, and B. Huffaker, "On routing asymmetry in the 
Internet," in Proc. of IEEE Globecom, 2005. 

Y. Chen, X. Wang, X. Song, E. K. Lua, C. Shi, X. Zhao, B. Deng, 
and X. Li, "Phoenix: Towards an accurate, practical and decentralized 
network coordinate system," in Proc. of IFIP Networking Conference, 
Aachen, Germany, May 2009. 

L. Tang and M. Crovella, "Virtual landmarks for the Internet," in Proc. 
of ACM/SIGCOMM Internet Measurement Conference, Oct. 2003. 

B. Wong, A. Slivkins, and E. Sirer, "Meridian: A lightweight network lo- 
cation service without virtual coordinates," in Proc. of ACM SIGCOMM, 
Aug. 2005. 

G. H. Golub and C. F. van Loan, Matrix Computations, 3rd ed. Johns 
Hopkins University Press, 1996. 

A. M. Buchanan and A. W. Fitzgibbon, "Damped newton algorithms for 
matrix factorization with missing data," in Computer Vision and Pattern 
Recognition, 2005. 

D. D. Lee and H. S. Seung, "Algorithms for non-negative matrix 
factorization," in Advances in Neural Information Processing Systems. 
MIT Press, 2001, pp. 556-562. 

G. Takes, I. Pilszy, B. Nmeth, and D. Tikk, "Scalable collaborative fil- 
tering approaches for large recommender systems," Journal of Machine 
Learning Research, vol. 10, pp. 623-656, Jun. 2009. 
D. Bertsekas, Nonlinear programming. Athena Scientific, 1999. 
S. Boyd, A. Ghosh, S. Member, B. Prabhakar, and D. Shah, "Ran- 
domized gossip algorithms," IEEE Transactions on Information Theory, 
vol. 52, pp. 2508-2530, 2006. 

C. Hennig and M. Kutlukaya, "Some thoughts about the design of loss 
functions," REVSTAT-Statistical Journal, vol. 5, no. 1, 2007. 

Q. Ke and T. Kanade, "Robust L\ norm factorization in the presence 
of outliers and missing data by alternative convex programming," in 
Computer Vision and Pattern Recognition, 2005, pp. 592-599. 
C.-J. Lin, "Projected gradient methods for nonnegative matrix factoriza- 
tion," Neural Computation, vol. 19, pp. 27 56-2779, Oct 2007. 

A simulator for peer-to-peer protocols, http://www.pdos.lcs.mit.edu/ 
p2psim/index.html 

K. P. Gummadi, S. Saroiu, and S. D. Gribble, "King: Estimating latency 
between arbitrary Internet end hosts," in Proc. of the ACM/SIGCOMM 
Internet Measurement Workshop, Marseille, France, Nov. 2002. 
matlab mdscale function, http://www.mathworks.de/access/helpdesk/ 
help/toolbox/stats/mdscale.html 

Y. Liao, W. Du, P. Geurts, and G. Leduc, "Decentralized prediction of 
end-to-end network performance classes," in Proc. of CoNEXT, Tokyo, 
Japan, Dec. 2011. 



Yongjun Liao is a PhD student at Department 
of Electrical Engineering and Computer Science, 
University of Liege, Belgium. She received her B.S. 
in 1999 and M.S. in 2002, both from Department of 
Computer Science, Guangxi University, China. Be- 
fore joining the networking group RUN in University 
of Liege in 2007, she had worked as a software engi- 
neer in a small computer company in Beijing, China. 
Her main research interests are the applications of 
machine learning techniques to computer networking 
problems, specifically the prediction of end-to-end 
network performance in large-scale networks. 





Pierre Geurts is an assistant professor in the EECS 
department of the University of Liege, Belgium. 
He graduated as an electrical (computer science) 
engineer in 1998 and received the PhD degree in 
applied sciences in 2002. From 2006 to 2011, he 
was research associate of the FNRS (Belgium). His 
research interests concern the design of, compu- 
tationally and statistically efficient, supervised and 
semi-supervised learning algorithms in order to ex- 
ploit structured input and output spaces (sequences, 
images, time-series, graphs), with applications in 
vision, and computer networks. 



Guy Leduc is a full professor in the EECS de- 
partment of the University of Liege, Belgium, and 
is since 1997 the head of the Research Unit in 
Networking (RUN). He graduated as an electrical 
(electronics) engineer in 1983 and got his PhD in 
computer science in 1991. 

His research field is computer networks, and 
his main research interests are Network Coor- 
dinate Systems, overlays, traffic engineering, re- 
silience, multimedia, congestion control, and auto- 
nomic/active/programmable networks. His research 
unit is or has been involved in European projects such as ECODE on cog- 
nitive networking, ResumeNet on Resilient Networking, ANA on autonomic 
networking, TOTEM on an open-source toolbox for traffic engineering, and 
the E-NEXT European network of excellence. 

Since 2007 he has been the chairman of the IFIP Technical Committee 
(TC6) on Communications Systems. He is an area editor of the Elsevier 
Computer Communications journal, and a steering committee member of the 
IFIP Networking Conference. 





Wei Du is a senior Postdoctoral researcher at Intel- 
ligent and Interactive Systems (IIS), University of 
Innsbruck, Austria. He received his B.S. in 1997 
from Tianjin University, China, and PhD in 2002 
from Institute of Computing Technology, Chinese 
Academy of Sciences, China. Since graduation, he 
has been working as postdoctoral researcher at IN- 
RIA, France, Hamburg University, Germany, Univer- 
sity of Liege, Belgium, and University of Innsbruck, 
^ Austria. His main research interests are computer 
vision and machine learning. 



